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This application is based on Japanese Patent 
Application Nos. 2000-212158 filed July 13, 2000 and 
2000-380781 filed December 14, 2000, the contents of which 
are incorporated hereinto by reference . 

5 

BACKGROUND OF THE INVENTION 
FIELD OF THE INVENTION 



10 The present invention relates to an apparatus, a 

method and a recording medium generally applied in the 
speech recognition. 
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DESCRIPTION OF THE RELATED ART 



One reason for the fact that the oral communication 
is an excellent means for human beings to exchange 
information is that a listener can help a speaker's speech 
act or concept forming. In the human speech dialog, 

20 therefore, even when the speaker stumbles in his speech, 
the listener may guess what the speaker intends to say and 
suggest some candidates helping the speaker remember what 
he has intended to say. For example, when the speaker 
cannot remember the word "speech complementing" and 

25 stumbles (hesitates) saying "speech, er. . the listener 
can help the speaker by asking whether he intended to say 
"speech complementing?". In this process, the listener 
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presents a candidate for the word the speaker has intended 
to say by complementing the fragment of the word the speaker 
has uttered. Hence, this process may be regarded as word 
complementing . 

5 The concept of complementing has been widely applied 

to text interfaces. For example, several text editors 
(e.g., Emacs and Mule) and UNIX shells (e.g., tcsh and bash) 
provide the complementing function (called "completion") 
for file names and command names. In such a function, when 
10 the user presses a key (typically the Tab key) to call the 
0 complementing function (hereinafter referred to as 

y "complementing trigger key" ) , the remaining portion of the 

f: fraction of a word that has been typed halfway is 

complemented. In WWW browsers such as Netscape 
{\ 15 Communication and Internet Explorer also, the automatic 
f complementing function (called "autocompletion" ) for URLs 

rij has been introduced, wherein the system provides lists of 

complementing candidates one after another while the user 
p is typing. 

apt. 

^ 20 Recently, the complementing function has been 

introduced into pen-based interfaces. For example, 
interfaces with automatic complementing functions such as 
a predictive pen- input interface and POBox have been 
proposed. (As for the predictive pen- input, refer to 
25 Toshikazu FUKUSHIMA and Hiroshi YAMADA, "A Predictive 
Pen-Based Japanese Text Input Method and Its Evaluation", 
Transactions of Information Processing Society of Japan, 
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Vol* 37, No.l, pp. 23-30 (1996); for POBox, refer to Masui, 
T. , "An efficient Text Input Method for Pen-based 
Computers , " Proceedings of the ACM Conference on Human 
Factors in Computing Systems (CHI '98), pp. 328 - 335, 
1998) . 

For a speech input interface, however, the speech 
complementing input has not been realized because there 
has been no appropriate means for calling the complementing 
function while the speech is being inputted. 

SUMMARY OF THE INVENTION 



HF An object of the present invention is to provide 

Cj speech complementing apparatus, a method and a recording 

15 medium that can complement speech inputted. 

The present invention has enabled to provide a better 
ru speech input interface mainly operated by speech 

recognition by introducing a speech input interface 
function (hereinafter referred to as "speech 
20 complementing" or "speech completion"), which enables a 
system to complement an uttered speech even when a user 
speaks only fragments of words without all information that 
the user has intended to input during the speech is being 
inputted to the system. 
25 In order to realize the speech complementing, there 

are two possible methods as in the case of text 
complementing. One is complementing by the complementing 
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trigger key and another is automatic complementing wherein 
complementing candidates are presented in succession 
during the user's utterance. However, in the attempt to 
complement the speech automatically, it is unlikely that 
5 the system can present appropriate candidates in 

succession with the same accuracy as provided for the text 
complementing because the fragment of speech is very 
ambiguous for the system to recognize. Therefore, it is 
very possible that the automatic complementing function 
10 itself gets confused too much. Hence, the automatic 
p complementing does not seem to be applicable to the speech 

complementing, and it becomes important for the speech 
HR complementing that the complementing function can be 

Q called intentionally and effortlessly by the user when the 

15 user wants to see the complementary candidates. The key 
$ to realize an easy-to-use speech complementing function 

jj^J lies in how the complementing function can be called, in 

TU other words, what kind of complementing trigger key should 

S be used in the speech complementing application. 

20 According to the present invention, it is made 

possible for the user to call the complementing function, 
when desired, without any particular effort by assigning 
the filled pause, which is a phenomenon of stumbling (one 
of hesitation phenomena), the role of complementing 
25 trigger key. Filled pauses are classified into two groups 
which are fillers (transition words) and word lengthening 
(prolongation of syllables) . Here the filled pauses mean 
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the prolongation of vowels during both of fillers and word 
lengthening. Fillers in Japanese such as "eh...", 
"uhm. . . " , "ah. . . w , "maa. . ■ , "nnn. . . 11 , ano- 
"sono- . . . 11 , "kono- . . . , " etc. , and fillers in English such 
5 as "er. . . " , "enn. . . " , "uh. . . " , "urn. . . , " etc. include the 
filled pauses. It is quite a natural behavior for human 
beings to utter filled pauses during speech input, which 
can be utilized as the complementing trigger key. In fact , 
the filled pause plays a similar role in human conversation. 

10 A Speaker often stalls for time with a filled pause to 
remember the next word, or sometimes utilizes a filled 
pause expecting help from the listener. 

The above and other objects, effects, features and 
advantages of the present invention will become more 

15 apparent from the following description of embodiments 
thereof taken in conjunction with the accompanying 
drawings . 



20 



BRIEF DESCRIPTION OF THE DRAWINGS 



Fig. 1 is a flowchart that illustrates the speech 
complementing operation; 

Fig. 2 shows a screen during the input of "utada- 
" ( " -" indicates a filled pause.); 
25 Fig. 3 shows the screen during the filled pause of 

"da- ; " 

Fig. 4 shows a screen displaying the complementary 



- 5 - 



'»00 12/27 16:46 FAX 03 5561 7522 ^ ■ FJ # ?F ¥ ft Fif -> VENABLE a 010 



candidates; 

Fig. 5 shows a screen immediately after "No. 1" is 
inputted; 

Fig. 6 shows a screen wherein the " No. 1" candidate 
5 is highlighted; 

Fig. 7 shows a screen wherein the " No. 1" candidate 
is confirmed; 

Fig. 8 is a block diagram that illustrates the system 
configuration of the speech complementing apparatus 
10 according to the present invention; 

Fig. 9 is a flowchart that illustrates the speech 
complementing program; and 
NR Fig. 10 is an illustration for explaining the speech 

complementing . 
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DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS 



Two embodiments according to the present invention 
are described below. 
.20 (1) Speech Forward Complementing: Speech Complementing 
Method Using a Filled Pause 
In the first method , by assigning the filled pause, 
which is a phenomenon of stumbling (one of hesitation 
phenomena) , the role of complementing trigger key, it is 
25 made possible for the user to call the complementing 
function, when desired, without any particular effort. 
When the phrase "onsei hokan (in English, speech 
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complementing)" is registered in the dictionary of speech 
recognition, for example, the filled pause such as 
prolongation of the vowel "i" like "onsei-" ("-" indicates 
a filled pause) or the utterance of the filler (transition 
5 word) , "eh. . . 11 (in English, "er- ..."), like "onsei, eh. . . " 
(in English, "speech, er- . . , " ) causes the system to display 
the complemented phrase "onsei hokan" (in English, "speech 
complementing" ) . When there are a plurality of 
complementary candidates , the system displays the 
10 candidates on the screen or produces synthesized audio 
Q response so that the user can select an appropriate 

candidate. When there is only one candidate, the system 
may request the user to confirm it or automatically 
complete the input. 
15 (2) Speech Backward Complementing: Complementing Method 
Using Speech Wild Card 

In the second method, when the user utters a specific 
wild card key-word such as "nantoka- . . . " (in English, "so 
and so...," or "something. . . " ) intentionally producing 
20 a filled pause, the system assumes that the whole key word 
is a wild card (arbitrary character string), and 
complements the wild card portion by determining from the 
context . 

For example, when the phrase, "speech 
25 complementing, " is registered in the dictionary of a speech 
recognition system, if the user utters the key word with 
a filled pause, "so and so... complementing," a list of 
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complementary candidates with the part of "so and so, . 
being replaced with appropriate character strings, for 
example, "speech complementing," "speech forward 
complementing," and "speech backward complementing," are 
5 displayed for the user's selection* When there are a 
plurality of complementary candidates, the system may 
display the candidates on the screen or produce synthesized 
voice response. And the user can select an appropriate 
candidate from the candidates . When there is only one 
10 candidate, the system may request the user to confirm it 
3 or automatically complete the input . 

lj While speech complementing can be performed for 

•533T-. 

J various levels including word, phrase, and sentence, 

SI speech complementing for words using the filled pause 
'ul 15 method is explained hereunder. A word here is defined as 

f that which is registered in the word dictionary (of a 

ry language model) of the speech recognition system; a phrase 

1 : ¥ can also be registered as a single word in the word 

if? si 

0 dictionary. Therefore, when a combination of family name 

20 and given name such as "utada hikaru" is registered as a 
word, the filled pause after the syllable, "da" , of "utada" 
generates candidates including "utada hikaru" . 

Referring to Fig, 1, the user can input a word, as 
described below, depending on the speech complementing 
25 that uses a filled pause. 

1 . When the user prolongs a vowel halfway during the 
utterance of a word, a list of complement airy candidates 
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(words) beginning with the portion that has been uttered, 
is immediately displayed with numbers. 

Referring to Figures 2 and 3, for example, when the 
user inputs "utada-," as shown in Fig. 4, complementary 
5 candidates are displayed as follows: 

1. utada hikaru 2. uehara takako 3. MR. DYNAMITE 

2 . When the candidates are so many that all of them cannot 
be displayed on the screen, the mark, "Next candidates," 
is displayed. In such a case, the user can see other 

10 candidates by uttering the word or words, "next" or "next 
Q candidates". If there is no appropriate candidate or the 

user wants to input another word, the user can proceed to 
HP-' another utterance of speech without making selection in 

\! the following procedure 3 shown below. 

15 3 . The user can select one of the candidates seeing the 
list of candidates by one of the following four methods. 

iMr- 

hs (a) The user selects a candidate by uttering a number of 

*y the candidates as shown in Fig. 5. (For example, by saying 

Q "number one" or "one.") 

20 (b) The user selects a candidate by uttering the remaining 
part of the word. (For example, by saying "hikaru.") 

(c) The user selects a candidate by uttering the whole word 
(For example, by saying "utada hikaru."). 

(d) The user selects a candidate using some other device 
25 such as a keyboard, mouse, or touch panel. 

When a candidate is selected, it is highlighted as 
shown in Fig, 6, and it is confirmed as the result of the 
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speech recognition as shown in Fig. 7. 

In the speech complementing, it is possible to 
repeatedly call the candidates during the input of a word. 
For example, when inputting "Southern All Stars, " the user 
5 can display the candidate list after saying "Southern- . . . " , 
and then display a narrowed down list after saying 
"All- ... 11 r and finally the user can confirm by saying 
"Stars". It is necessary, as shown in this example, to 
arrange so that the complementary candidates are not called 
10 when the long vowel of "All" is uttered but they are called 
Pt only when the intentional filled pause is made. 

^ After a speech input interface system capable of 

speech complementing was actually constructed and 
Cj operated, and it has been confirmed that the speech 

15 complementing practically functions enabling the user to 
- perform interactive speech input by calling the 

,7- complementary candidates. It has been proved that the 

m speech complementing function requires no special 

S training, and the interface is easy to use with intuitive 

^ 20 operation. The speech complementing was particularly 
effective when inputting long words and phrases. 

Although it has been confirmed in the test operation 
that the function is especially effective in inputting 
proper nouns such as names of songs and artists or addresses , 
25 it is also applicable immediately to other voice input 
applications for various systems. 

According to the present invention, firstly, the 
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function helps the user recollect . Even when the user 
wants to input something that he does not remember clearly, 
he can input it with the system 1 s help by uttering a part 
of the word or words as much as he remembers . 
5 According to the present invention, secondly, when 

the phrase being inputted is long and complicated, the user 
has only to utter a portion of the phrase that is enough 
for the system to identify the content of the phrase so 
that the system can complement and input the remaining. 
10 Additionally, according to the present invention, 

the speech complementing system causes less mental 
resistance to use because input can be made by uttering 
a portion of a word, phrase, or sentence; whereas the most 
of conventional voice interfaces require the user to 
15 carefully utter the whole sounds to the last. 

In Fig. 8, a preferred embodiment of the speech 
complementing apparatus based on the speech complimenting 
method according to the present invention is described 
hereunder . 

20 Referring to Fig. 8, the speech complementing 

apparatus comprises a CPU 10, a system memory 20, an 
input /output (I/O) interface 30, an input device 40, a 
display 50, and a hard disk (HD) 60. Information 
processing devices, such as a personal computer, which are 

25 capable of executing programs may be used for the speech 
complementing apparatus . 

The CPU 10 executes the program that is described 
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below, which has been loaded in the system memory 20 to 
facilitate the speech complementing function. The system 
memory 20 has a RAM to store the program to be executed 
as well as to temporarily store the input /output data to 
5 the CPU 10. 

The I/O 30 which is connected to a microphone (not 
shown) transmits the speech inputted from the microphone 
to the CPU 10. The input device 40 having a mouse, a 
keyboard or a touch panel instructs the operation of the 
10 CPU 10. The display 50 displays the inputted information 
from the input device 40 and the recognition results of 
the speech recognition processing which was executed by 
the CPU 10. The character strings complemented by the CPU 
10 are also displayed. Furthermore, when there are a 
^ 15 plurality of character strings that may be complemented, 
the plural sets of complementary character strings are 
7i displayed for the user to make the selection through the 

II I/O 30 or the input device 40. 

The hard disk 60 stores a speech recognition program 
20 with a speech complementing function, word dictionaries, 
a dictionary for operating an interface, data that are used 
by these programs for display, and other various data. 

As the speech recognition program, an equivalent of 
an article on the market can be used. However, it needs 
25 to improve by adding the following complementing function. 

When a filled pause period in the inputted speech is 
detected in the processing of the speech recognition, a 
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list of complementary candidates is made. The processing 
for making the list of the complementary candidates is 
realized by extending the conventional program of a 
continuous speech recognition as described below. 
5 At that time, it needs not to include harmful effects 

for recognizing a normal speech which does not include a 
filled pause. Though complementing of a single-word 
utterance is described below, complementing words and 
phrases in a continuous speech input is also possible in 
10 the same framework. The system uses a word dictionary 
Q (person 1 s names etc. ) to be inputted, a dictionary of wild 

J;^! card key-words and a dictionary for operating an interface 

f (candidate number and instructions for displaying other 

Cj candidates etc.). The word dictionary has a tree 

15 structure as shown in Fig. 10. The recognition processing 
starts from the root of the dictionary and increases 
i n hypotheses corresponding to the branching frame by frame, 

FU tracing the nodes toward the leaves . The wedge marks 

g represent hypotheses. When the filled pause is detected, 

Wf 20 whether the hypothesis with highest likelihood at this 
point is a wild card key-word or not is judged to determine 
which of speech forward complementing and speech backward 
complementing is executed. 

In the case of the speech forward complementing, the 
25 generation of complementary candidates is realized by 
tracing from the effective top-N seed hypotheses at this 
point (N seed hypotheses from top in order of high likelihood) 
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to the leaves . Those candidates are numbered in order of 
high likelihood to obtain top-N^^ candidates. 

The nodes corresponding to hypotheses used in the 
generation are referred to as seeds- For example, 
5 assuming that the uppermost black circle is a seed, 

complementary candidates are "Blankey jet city" and "Black 
flys" . At the same time, it is checked how far in each 
candidate has been uttered by obtaining the phoneme string 
that has been recognized. In order to make it possible 
10 to select by uttering the remaining part of a word, after 

Q the user looked at the candidates, an entry node table is 

Jf 8 * introduced for registering the root from which starts. 

5 p Thereby, the recognition from the middle of a word is 

enabled. For the normal recognition starting from the top 

#= 15 of a word, only the root of the word dictionary is 

registered. When the recognition from the middle of the 

LI word should be enabled for the candidate selection, the 

ill 

fy seed of each complementary candidate is added temporarily 

H as a root (only the next utterance after the utterance with 

20 the filled pause) . Although added entries are recognized 
by uttering only the remaining phoneme string after the 
filled pause, the result of the recognition is a whole word. 

On the other hand, in the case of the speech backward 
complementing, a latter part of a word uttered after end 
25 point of the wild card key-word with the filled pause is 
recognized so as to generate candidates . This recognition 
from the middle of the word is realized by temporarily 
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adding, to the entry node table, the syllables in the middle 
of all the words in the word dictionary (only just after 
a wild card key-word) . Then the hypotheses which reached 
to leaves are numbered in order of high likelihood and N cholce 

5 hypotheses from top are sent as complementary candidates . 
After that, in order to make it possible to select 
candidates by uttering the remaining first -half part of 
the word, the word in which end of the unuttered phoneme 
string of each candidate is a leaf is temporarily 

10 registered- For example, when "koyanagi yuki" is inputted 
in the form of the "something. • . yuki" (or "so and so... 
yuki") , the word in which the end of /koyanagi/ is a leaf 
is temporarily added to the word dictionary. 

Referring to Fig. 9, the operation of the speech 

15 complementing apparatus will be described as follows . Fig. 
9 shows the content of the speech recognition program. The 
speech recognition program is loaded to the system memory 
20 from the hard disk 60 according to the instruction from 
the input device 40, and then executed by the CPU 10 . When 

20 a speech is inputted from the microphone (not shown) 

through the I/O 30 , the CPU 10 temporarily stores the speech 
in the form of digital signal in the system memory 20 (step 
S10). At this time, the CPU checks the inputted speech 
to detect a period of the filled pause (prolongation of 

25 vowels and a continuous voiced sound) by using a method 
disclosed in Japanese patent application No. 11-305768 
(1999) . 
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When the period of the filled pause is detected, 
speech recognizing and complementing of the speech data 
which has been stored in the system memory 20 are executed 
by using the speech recognition program (step S30). The 

5 result of speech recognition is obtained by generating a 
number of hypotheses of phoneme strings based on the word 
dictionary and the grammar and by evaluating the likelihood 
of those hypotheses in order to obtain appropriate results 
from top in order of high likelihood. 

10 In parallel with such processing, the CPU 10 is always 

judging whether or not there is a filled pause. When the 
existence of the filled pause is not detected, the process 
proceeds along steps S30 and S70 and then the result of 
the speech recognition obtained is output ted* 

15 The complementary candidates are displayed on the 

screen of the display 50 . The user selects suitable one 
in the complementary candidates by using a speech, the 
mouse or the touch panel (step S60). When the number of 
the candidates is one, it is possible to wait for selection 

20 for confirming by the user and automatic selection 

(decision) may be allowable. The selected candidate is 
decided as the final result of complementing. The result 
of complementing is also outputted as the result of the 
speech recognition (step S70). 

25 In addition to the embodiment described above, the 

following embodiments are also possible: 
1 ) The output units for the complemented speech recognition 
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results can include, in addition to the display, a printer, 
a speaker for synthesized voice output, a communication 
device to other computers, and a drive that writes 
information in recording medium such as a floppy disk, 
5 When synthesized voice is output from the speaker, 

the voice may be synthesized based on the character strings 
resulting from the speech recognition using a conventional 
voice- synthesizing program. 

Furthermore, when there are a plurality of 
10 complementary character strings , the selective candidates 
Q may be informed to the user using the synthesized voice, 

%j In such a case, the user may input the selection by speech 

input as well as by the input device 40. When the speech 
HI input is used, the CPU 10 recognizes the input speech and 

*fl 15 identifies the complementary character string that has 
been selected by the user based on the speech recognition 
fy results. 

If 2) The speech complementing apparatus may utilize any 

Q equipment such as an IC chip, a cellular phone, a personal 

20 computer, and other information processing devices. 

The present invention has been described in detail 
with respect to preferred embodiments, and it will now be 
apparent from the foregoing to those skilled in the art 
that changes and modifications may be made without 
25 departing from the invention in its broader aspects, and 
it is the intention, therefore, in the appended claims 
to cover all such changes and modifications as fall within 
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the true spirit of the invention. 
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